Since the dawn of time, people have been developing human-like attachments to their transportation. I imagine Moses naming his chariot, Noah his ark, and maybe even Jesus naming his borrowed donkey.
Well, maybe I’m taking too much historical liscence, but man’s affinity for their transportation remains.
At the age of 21, my dad and I split the cost of a 2005 Ford F150. The odometer read ~24,000 miles and it set us back ~$20,000.
I’ve been with this truck for 14 years - longer than I’ve known my wife. Its been with me from Texas to New York to Georgia to Missouri to Hawaii to Missouri to New Mexico to Ohio to New York to Virginia. Tragically, on the drive from New York to Virginia, the engine developed a tremendous rattling noise and a violent shake. Unbeknownst to me, it developed an oil leak sometime in the summer of ’19 and the 70 mph roads were too much for it to handle. Just short of its 200,000th mile, I laid it to rests in Fairfax Station, Virginia. It lived long enough to meet its successor, Ole Grey Truck II.
The following is a (toungue-in-cheek) R style ode to my old friend. Long live Ole Grey!
I’ll explore the following data. I made an earlier version of this blog post here where I looked at some similar, but incomplete, analysis on this data.
I’ll use the following packages to load, manipulate, and view the data.
library(tidyverse)
library(lubridate)
library(skimr)
library(kableExtra)
Below, you’ll notice that the observational unit is each time I filled up the pick-up. The column headers are pretty self explanitory.
To be explicit, here is a description of the data:
FillUpDate: Date I filled up the pick up with gasoline.TotalMiles: Odometer reading when I filled up the tank.TankMiles: Trip odometer reading for the most recent tank of gas. Reset every ‘fill up.’Gallons: Gallons of gasoline put in the tank at time of fill up.MPG: Calculated miles per gallon on the previous tank of gas. TankMikes/GallonsFinishTankDate: Date of previous fill up.Location: is the location that I lived at the time of filling up the tank.head(truck)
## # A tibble: 6 x 7
## FillUpDate TotalMiles TankMiles Gallons MPG FinishTankDate Location
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 17-Feb-06 24251 366 22.6 16.2 18-Feb-06 Texas
## 2 18-Feb-06 24443. 192. 12.8 15.0 18-Feb-06 Texas
## 3 18-Feb-06 24765 322. 16.5 19.5 27-Feb-06 Texas
## 4 27-Feb-06 25156 391 24.9 15.7 10-Mar-06 Texas
## 5 10-Mar-06 25431. 275. 15.8 17.4 11-Mar-06 New York 1
## 6 11-Mar-06 25548. 117. 7.5 15.6 11-Mar-06 New York 1
Thanks to readr, most of the columns read in appropraitly. However, it looks like we’ll need to make an adustment so that the dates are the correct class. Thanks lubridate!
truck =
truck %>%
mutate(FillUpDate = dmy(FillUpDate),
FinishTankDate = dmy(FinishTankDate))
head(truck)
## # A tibble: 6 x 7
## FillUpDate TotalMiles TankMiles Gallons MPG FinishTankDate Location
## <date> <dbl> <dbl> <dbl> <dbl> <date> <chr>
## 1 2006-02-17 24251 366 22.6 16.2 2006-02-18 Texas
## 2 2006-02-18 24443. 192. 12.8 15.0 2006-02-18 Texas
## 3 2006-02-18 24765 322. 16.5 19.5 2006-02-27 Texas
## 4 2006-02-27 25156 391 24.9 15.7 2006-03-10 Texas
## 5 2006-03-10 25431. 275. 15.8 17.4 2006-03-11 New York 1
## 6 2006-03-11 25548. 117. 7.5 15.6 2006-03-11 New York 1
There are a handfull of ‘fun’ descriptive statistics from the data.
summary(truck)
## FillUpDate TotalMiles TankMiles Gallons
## Min. :2006-02-17 Min. : 24251 Min. : 79.4 Min. : 6.06
## 1st Qu.:2009-07-04 1st Qu.: 65499 1st Qu.:333.0 1st Qu.:22.61
## Median :2012-02-17 Median :110550 Median :414.2 Median :27.30
## Mean :2011-12-13 Mean :108423 Mean :392.7 Mean :25.59
## 3rd Qu.:2014-03-05 3rd Qu.:148723 3rd Qu.:459.3 3rd Qu.:29.73
## Max. :2019-07-19 Max. :193497 Max. :799.6 Max. :39.40
## NA's :1
## MPG FinishTankDate Location
## Min. : 8.394 Min. :2006-02-18 Length:432
## 1st Qu.:14.506 1st Qu.:2009-07-08 Class :character
## Median :15.597 Median :2012-02-21 Mode :character
## Mean :15.373 Mean :2011-12-18
## 3rd Qu.:16.503 3rd Qu.:2014-03-08
## Max. :25.834 Max. :2019-07-19
## NA's :1 NA's :1
We see that I more or less owned the truck from Feb of 06 through July of 2019. I filled it up 433 times and drove it unitl it his 193,497 miles…. so close!
The average miles per gallon was 15.6 and I even attained 25ish one time. The least was 8.4. I likely was pulling my wife’s car on a trailer on that tank.
Looks like I usually drove about 333 miles on a tank. I put as few at ~80 and as much as 799. Must have been the time a buddy filled it up and didn’t tell me :).
While it ony has a 34 gallong tank, the max gallons added to the truck was 39.4. This is likely due to times when friends borrowed the truck and put in a few gallons for me and didn’t note it in my log. I compensated for it by adding it to the next fill up.
How many times did this happen?
truck %>%
select(Gallons) %>%
filter(Gallons>34)
## # A tibble: 5 x 1
## Gallons
## <dbl>
## 1 39.4
## 2 34.2
## 3 34.4
## 4 34.2
## 5 34.6
Lets look at the spread of the data a little closer to get a feel for the distribution of gallons per fill up, miles per gallon, and miles per tank of gas.
truck %>%
gather(Measurement, Value, -FillUpDate, -FinishTankDate, -Location) %>%
filter(Measurement != "TotalMiles") %>%
ggplot(aes(x=Value)) +
geom_histogram() +
facet_wrap(~Measurement, scales = "free_x") +
labs(x="", y = "Number of Occurances (Fill Ups)", title = "Frequency of Occurance for Each Fill Up")
It looks like gallons and miles per tank of gas (unsurprisingly) are right scewed as factors have a cap on the upper end (with a few exceptions when friends fill up and don’t tell me). Miles per gallon looks more normally distributed with a few more occurances of fewer miles per gallon.
Lets see how well my truck performed in each location.
library(formattable)
truck %>%
select(TankMiles,Gallons,MPG,Location) %>%
group_by(Location) %>%
summarise_all(.funs = mean, na.rm = TRUE) %>%
mutate_if(is.numeric, round, 2) %>%
mutate(MPG = cell_spec(MPG, "html", color = ifelse(MPG == max(MPG), "red", ifelse(MPG== min(MPG), "blue", "black")))) %>%
mutate(TankMiles = cell_spec(TankMiles, "html", color = ifelse(TankMiles == max(TankMiles), "red", ifelse(TankMiles== min(TankMiles), "blue", "black")))) %>%
mutate(Gallons = cell_spec(Gallons, "html", color = ifelse(Gallons == max(Gallons), "red", ifelse(Gallons== min(Gallons), "blue", "black")))) %>%
kable(format = "html", escape = F, align = "l") %>%
kable_styling("striped", full_width = F)
| Location | TankMiles | Gallons | MPG |
|---|---|---|---|
| Georgia | 377.61 | 21.81 | 17.31 |
| Hawaii | 430.06 | 27.88 | 15.43 |
| Missouri 1 | 387.49 | 25.38 | 15.34 |
| Missouri 2 | 344.99 | 22.42 | 15.51 |
| New Mexico | 374.48 | 24 | 15.55 |
| New York 1 | 333.94 | 20.7 | 16.11 |
| New York 2 | 372.16 | 27.9 | 13.12 |
| Ohio | 444.93 | 29.11 | 15.22 |
| Texas | 317.75 | 19.2 | 16.6 |
| Virginia | 337.6 | 24.9 | 11.02 |
This chart shows highlights some intersting points.
First, I did a ton of highway driving while I lived in Georgia to attain the best miles per gallon.
My top average miles per tank and average gallons per fill up were from Ohio. I would drive my truck till it was almost empty because I could get free 30 gallons of gasoline based off of money spent at our local grocery store.
The worse miles per gallon is Virginia. Well, since I only had two fill ups and they were both after I essentially killed the truck, this is no suprise.
The worse miles per tank and gallons per fill up were in Texas when my dad still possessed the truck (and doesn’t live as close to the edge as I do).
How much time did I spend in each location? How many miles did I drive at each place?
truck %>%
group_by(Location) %>%
summarise(Arrive_Location = min(FillUpDate),
Leave_Location = max(FillUpDate),
MilesDriven = round(max(TotalMiles)-min(TotalMiles))) %>%
arrange(Arrive_Location) %>%
mutate(DaysInLocation = Leave_Location-Arrive_Location,
MilesPerDay = round(MilesDriven/as.numeric(DaysInLocation))) %>%
DT::datatable(options = list(paging = FALSE))
Interesting Notes:
Lets look at this visually.
truck %>%
group_by(Location) %>%
summarise(Arrive_Location = min(FillUpDate),
Leave_Location = max(FillUpDate),
MilesDriven = round(max(TotalMiles)-min(TotalMiles))) %>%
arrange(Arrive_Location) %>%
mutate(DaysInLocation = Leave_Location-Arrive_Location,
MilesPerDay = round(MilesDriven/as.numeric(DaysInLocation)),
DaysInLocation = as.numeric(DaysInLocation)) %>%
gather(Attribute, Measurement, -Location, -Arrive_Location, -Leave_Location) %>%
group_by(Attribute) %>%
mutate(order = rank(Measurement)) %>%
# ggplot(aes(x=Location,y=Measurement)) +
ggplot(aes(x=drlib::reorder_within(Location,-order,Measurement),y=Measurement)) +
geom_col(aes(fill = Location)) +
drlib::scale_x_reordered() +
facet_wrap(~Attribute, scales = "free") +
labs(title = "Miles Driven in Each Location", x = "Location", y = "Miles Driven") +
coord_flip() +
theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))
A picture (graph) is worth 1000 words.
truck %>%
ggplot(aes(x=FillUpDate,y=TotalMiles,color = Location) ) +
geom_point() +
geom_line() +
scale_x_date(breaks = "year",labels = scales::date_format("%Y")) +
scale_y_continuous(labels = scales::comma) +
labs(title = "Miles Added Over Time", x = "Fill Up Date", y = "Total Miles")
From this plot, you can see several take aways:
How often did a tank of gas last?
truck %>%
select(FillUpDate,FinishTankDate) %>%
mutate(DaysBetweenFillUps = FinishTankDate-FillUpDate) %>%
group_by(DaysBetweenFillUps)%>%
count() -> FillUps
FillUps %>%
arrange(desc(n)) %>%
ungroup() %>%
top_n(5,n) %>%
kable() %>%
kable_styling("striped", full_width = F)
| DaysBetweenFillUps | n |
|---|---|
| 1 days | 32 |
| 8 days | 32 |
| 6 days | 31 |
| 0 days | 28 |
| 9 days | 28 |
It appears the most often days between fill ups was a tie for both 1 day and 8 days.
Lets Visualize it:
FillUps %>%
ggplot(aes(x=DaysBetweenFillUps,y=n)) +
geom_col() +
labs(x="Days Between Fill Ups", y="Number of Fill Ups", title = "How Many Days Did a Tank of Gas Last?")
FillUps %>%
arrange(desc(DaysBetweenFillUps)) %>%
ungroup() %>%
top_n(5,DaysBetweenFillUps) %>%
kable() %>%
kable_styling("striped", full_width = F)
| DaysBetweenFillUps | n |
|---|---|
| 120 days | 1 |
| 73 days | 1 |
| 66 days | 2 |
| 63 days | 1 |
| 61 days | 2 |
Also of note, the longest time between fill ups was 120 days. This must have been during one of my deployments.
I also wonder if I happen to have filled up the tank more often on one day than another.
truck %>%
mutate(Weekday = weekdays(FillUpDate)) %>%
select(TankMiles,Location,Weekday,FillUpDate) %>%
# select(-FillUpDate,-FinishTankDate) %>%
# gather(Measurement, Value, -Weekday, -Location) %>%
ggplot(aes(x=fct_relevel(Weekday, c("Sunday","Monday","Tuesday","Wednesday","Thursday")))) +
geom_bar() +
labs(x="Day of Week", y = "Number of Fillups", title = "What Day Did I Fill Up The Most?") +
theme(legend.position = "none", axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))
The barchart indicates that I might fill up more on Friday than any other day.
Lets look at a few statistical tests to see if Weekday impacts fillups.
fisher =
truck %>%
mutate(Weekday = weekdays(FillUpDate)) %>%
select(Weekday) %>%
count(Weekday, name = "DaysFilledUp") %>%
left_join(
tibble(
date = seq(ymd(min(truck$FillUpDate)), ymd(max(truck$FillUpDate)), by = '1 day')) %>%
mutate(Weekday = weekdays(date)) %>%
count(Weekday, name = "Total")) %>%
mutate(DaysNotFilledUp = Total-DaysFilledUp) %>%
select(-Total)
fisher
## # A tibble: 7 x 3
## Weekday DaysFilledUp DaysNotFilledUp
## <chr> <int> <int>
## 1 Friday 75 626
## 2 Monday 60 640
## 3 Saturday 66 634
## 4 Sunday 56 644
## 5 Thursday 62 638
## 6 Tuesday 59 641
## 7 Wednesday 54 646
The table above shows how many times I filled up with gas on each day of the week and how many days I had the opportunity to fill up and did not. Essentially the question we are asking is: Is there something about Day of the Week which would impact whether or not I filled up with gas? Equivalently, is there something about filling up (or not) that impacts the day of the week? AKA: Are day of the week and fill ups independent?
Lets turn to several statistical tests.
If you are old school: The \(\chi^2\) test.
fisher %>%
select(-Weekday) %>%
chisq.test()
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 5.2405, df = 6, p-value = 0.5134
Since the null hypothesis is that there is no difference between day of week and fill up or not (also known as independence), the P Value is high so we ‘fail to reject’ this. We conclude that there’s no difference between day of week and decision to fill up or not. We could also say that the probability that they came from the same distribution is fairly high (.5134) therefore we do not have much evidence to say that they are different.
If you are new school, you turn to Fisher’s Exact Test. Essentially, if we lock in the amount of ‘fill ups’ over the time I owned the pick up and were to calculate every possible combination of days of which I could fill up with gas (eg, all on Monday, all but one on Monday and one on Tuesday, and so on), how extreme is what we saw. Fisher figured out that we could calculate that exact probabily - and hence we have Fisher’s Exact Test.
fisher %>%
select(-Weekday) %>%
fisher.test(workspace = 1000000)
##
## Fisher's Exact Test for Count Data
##
## data: .
## p-value = 0.5311
## alternative hypothesis: two.sided
It turns out that there is a 53.11% chance that we’d see what happened or see something more extreme by chance. I dare say that there is no relationship between day of week and the decision to fill up with gas.
The map below shows dates in which I filled up my truck with gasoline. The darker green indicates more gallons – the lighter indicates fewer gallons. The dark dates mean I did not fill up with gasoline on that date.
truck %>%
complete(FillUpDate = seq.Date(min(FillUpDate), max(FillUpDate), by="day")) %>%
mutate(monthweek = week(FillUpDate)) %>%
mutate(Weekday = weekdays(FillUpDate)) %>%
mutate(year = year(FillUpDate)) %>%
mutate(month = month(FillUpDate)) %>%
group_by(month) %>%
mutate(monthweek = monthweek - min(monthweek)) %>%
ggplot(aes(y=monthweek,x=fct_relevel(Weekday, c("Sunday","Monday","Tuesday","Wednesday","Thursday")), fill = Gallons)) +
geom_tile(color = "white") +
facet_grid(year~month, scales = "free") +
scale_fill_gradient(low="white", high="green", na.value = "black") +
labs(x = "Weekday", y = "Week of Month", title = "Time-Series Calendar Heatmap: Amount of Gas Pumped on Each Day", caption = "Days with no gas are denoted in White") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = -0), axis.text.y = element_blank(), axis.ticks.y = element_blank())
I’ll admit, its hard to glean anything from this heatmap…. But I spent a lot of work creating it so I’m going to leave it.
As always, I do this blog to learn new things, hopefully teach others from my experience, and hopefully receive (respectful and helpful) feedback.
Here’s a picture of old and new meeting!
Please leave your thoughts below!!